58 research outputs found

    Fast Data in the Era of Big Data: Twitter's Real-Time Related Query Suggestion Architecture

    Full text link
    We present the architecture behind Twitter's real-time related query suggestion and spelling correction service. Although these tasks have received much attention in the web search literature, the Twitter context introduces a real-time "twist": after significant breaking news events, we aim to provide relevant results within minutes. This paper provides a case study illustrating the challenges of real-time data processing in the era of "big data". We tell the story of how our system was built twice: our first implementation was built on a typical Hadoop-based analytics stack, but was later replaced because it did not meet the latency requirements necessary to generate meaningful real-time results. The second implementation, which is the system deployed in production, is a custom in-memory processing engine specifically designed for the task. This experience taught us that the current typical usage of Hadoop as a "big data" platform, while great for experimentation, is not well suited to low-latency processing, and points the way to future work on data analytics platforms that can handle "big" as well as "fast" data

    Semantic lexicon adaptation for use in query interpretation

    Full text link
    We describe improvements to the use of semantic lexicons by a state-of-the-art query interpretation system powering a major search engine. We successfully compute concept la-bel importance information for lexicon strings; lexicon aug-mentation with such information leads to a 6.4 % precision increase on affected queries with no query coverage loss. Fi-nally, lexicon filtering based on label importance leads to a 13 % precision increase, but at the expense of query cover-age

    Wikum: Bridging Discussion Forums and Wikis Using Recursive Summarization

    Get PDF
    Large-scale discussions between many participants abound on the internet today, on topics ranging from political arguments to group coordination. But as these discussions grow to tens of thousands of posts, they become ever more difficult for a reader to digest. In this article, we describe a workflow called recursive summarization, implemented in our Wikum prototype, that enables a large population of readers or editors to work in small doses to refine out the main points of the discussion. More than just a single summary, our workflow produces a summary tree that enables a reader to explore distinct subtopics at multiple levels of detail based on their interests. We describe lab evaluations showing that (i) Wikum can be used more effectively than a control to quickly construct a summary tree and (ii) the summary tree is more effective than the original discussion in helping readers identify and explore the main topics

    Multiple ranking strategies for opinion retrieval in blogs

    No full text
    We describe our participation in the Opinion Retrieval task at TREC 2006. Our approach to identifying opinions in blog post consisted of scoring the posts separately on various aspects associated with an expression of opinion about a topic, including shallow sentiment analysis, spam detection, and link-based authority estimation. The separate approaches were combined into a single ranking, yielding significant improvement over a content-only baseline

    Using blog properties to improve retrieval

    No full text
    This paper describes three simple heuristics which improve opinion retrieval effectiveness by using blog-specific properties. Blog timestamps are used to increase the retrieval scores of blog posts published near the time of a significant event related to a query; an inexpensive approach to comment amount estimation is used to identify the level of opinion expressed in a post; and query-specific weights are used to change the importance of spam filtering for different types of queries. Overall, these methods, combined with non-blogspecific retrieval approaches, result in substantial improvements over state-of-the-art

    Miscellaneous General Terms Languages, Management

    No full text
    We describe a system for automating call-center analysis and monitoring. Our system integrates transcription of incoming calls with analysis of their content; for the analysis, we introduce a novel method of estimating the domain-specific importance of conversation fragments, based on divergence of corpus statistics. Combining this method with Information Retrieval approaches, we provide knowledge-mining tools both for the call-center agents and for administrators of the center

    Leave a Reply: An Analysis of Weblog Comments

    No full text
    Access to weblogs, both through commercial services and in academic studies, is usually limited to the content of the weblog posts. This overlooks an important aspect distinguishing weblogs from other web pages: the ability of weblog readers to respond to posts directly, by posting comments. In this paper we present a large-scale study of weblog comments and their relation to the posts. Using a sizable corpus of comments, we estimate the overall volume of comments in the blogosphere; analyze the relation between the weblog popularity and commenting patterns in it; and measure the contribution of comment content to various aspects of weblog access
    • …
    corecore